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Abstract. The selection problem, where one wishes to locate the k 
smallest element in an unsorted array of size n, is one of the basic prob- 
lems studied in computer science. The main focus of this work is de- 
signing algorithms for solving the selection problem in the presence of 
memory faults. These can happen as the result of cosmic rays, alpha 
particles, or hardware failures. 

Specifically, the computational model assumed here is a faulty variant 
of the RAM model (abbreviated as FRAM), which was introduced by 
Finocchi and Italiano |FI04] . In this model, the content of memory cells 
might get corrupted adversarially during the execution, and the algo- 
rithm cannot distinguish between corrupted cells and uncorrupted cells. 
The model assumes a constant number of reliable memory cells that never 
become corrupted, and an upper bound S on the number of corruptions 
that may occur, which is given as an auxiliary input to the algorithm. 
An output element is correct if it has rank between k — a and k + a 
in the input array, where a is the number of corruptions that occurred 
during the execution of the algorithm. An algorithm is called restUent if 
it always outputs a correct answer. 

The main contribution of this work is a deterministic resilient selection 
algorithm with optimal 0(n) worst-case running time. Interestingly, the 
running time does not depend on the number of faults, and the algorithm 
does not need to know 5. As part of the solution, several techniques that 
allow to sometimes use non-tail recursion algorithms in the FRAM model 
are developed. Notice that using recursive algorithms in this model is 
problematic, as the stack might be too large to fit in reliable memory. 

The aforementioned resilient selection algorithm can be used to im- 
prove the complexity bounds for resilient k-A trees developed by Gieseke, 
Moruz and Vahrenhold [GMVIO] . Specifically, the time complexity for 
constructing a fc-d tree is improved from 0(n log^ n + 5'^) to O(nlogn). 

Besides the deterministic algorithm, a randomized resilient selection al- 
gorithm is developed, which is simpler than the deterministic one, and 
has 0{n-\-a) expected time complexity and 0(1) space complexity (i.e., 
is in-place). This algorithm is used to develop the first resilient sorting 
algorithm that is in-place and achieves optimal 0{n\ogn-\- aS) expected 
running time. 



1 Introduction 



Computing devices are becoming smaller and faster. As a result, the likelihood 
of soft memory errors (which are not caused by permanent failures) is increased. 
In fact, a recent practical survey |Sem04) concludes that a few thousands of 
soft errors per billion hours per megabit is fairly typical, which would imply 
roughly one soft error every five hours on a modern PC with 24 gigabytes 
of memory [CDKllj . The causes of these soft errors vary and include cosmic 
rays |Bau05) . alpha particles |MW79) . or hardware failures |LHSC1Q] . 

1.1 The Faulty RAM Model 

To deal with these faults, the faulty RAM (FRAM) model has been proposed by 
Finocchi and Italiano |FI04j . and has received some attention 
|BFF+07lBJM09IBJMM09ICFFSlllFGI09alFGI09blGMV10IJMM07j . 
In this model, an upper bound on the number of corruptions is given to the 
algorithm, and is denoted by S, while the actual number of faults is denoted by 
a {a < d). Memory cells may become corrupted at any time during an algo- 
rithm's execution and the algorithm cannot distinguish between corrupted cells 
and uncorrupted cells. The same memory cell may become corrupted multiple 
times during a single execution of an algorithm. In addition, the model assumes 
the existence of 0(1) reliable memory cells, which are needed, for example, to 
reliably store the code itself. A cell is assumed to contain (9(log n) bits, where n 
is the size of the input, as is usual in the RAM model. 

One of the interesting aspects of developing algorithms in the FRAM model is 
that the notion of correctness is not always clear. Usually, correctness is defined 
with respect to the subset of uncorrupted memory cells and in a worst-case sense, 
implying that for an algorithm to be correct, it must be correct in the presence 
of any faulty environment, including an adversarial environment. For example, 
in the sorting problem the goal is to order the input elements such that the 
uncorrupted subset of the array is guaranteed to be sorted |FI04| . In the FRAM 
model, an algorithm that is always correct (which is problem dependent) is called 
resilient. 

A naive way of implementing a resilient algorithm is by storing 26 +1 copies 
of every piece of data. Writing is done by writing the same value to all copies, 
and reading is done by computing the majority of the copies. Using this tech- 
nique, most if not alJll non-resilient algorithms can be made resilient with 0{5) 
multiplicative overhead in time and space complexity. 

1.2 Previous Work 

A summary of the algorithms and data structures that have been developed in 
the FRAM model is given next. 

^ The reason this might not be true is because it could depend on the correctness 
of the problem under the FRAM model. For example, the goal of finding the exact 
fc-order statistics is not achievable in this model, as is explained in Section [2l 
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Resilient Searching: Finocchi and Italiano jFI04j and Finocchi, Grandoni and 
Italiano |FGI09a| . developed an almost optimal resilient searching algorithm, 
which finds an element in a sorted array of size n in 0(logn + <5^~'''^), where 
uncorrupted elements are guaranteed to be found. The main idea is to perform 
a slow reliable verification step once in every 0{d) fast but unreliable binary 
search steps. A somewhat natural lower bound of 0(logn + S) was proven there 
as well. A matching upper bound was developed by Brodal, Fagerberg, Finocchi, 
Grandoni, Italiano, J0rgensen, Moruz and M0lhave |BFF+07) . using a different 
method. 

Resilient Dictionaries: The dynamic counterpart of searching is the dictionary 
data structure. An optimal resilient dictionary, supporting updates (insertions 
and deletions) and queries (searches) in 0(logn + 6) amortized time per oper- 
ation, was developed by Brodal et al. [BFF+OT] . Again, uncorrupted elements 
are guaranteed to be found. 

Resilient Sorting: Finocchi et al. |FI04IFGI09aj . developed a resilient sorting 
algorithm, sorting an array of size n in 0(n log n + aS) time. The uncorrupted 
subset of the array is guaranteed to be sorted. The algorithm is an iterative 
version of Mergesort, with a resilient merging step. A matching (and somewhat 
surprising) lower bound was proven there as well. 

Resilient Priority Queues: Another basic data structure, a resilient priority 
queue, was developed by J0rgensen, Moruz and M0lhave JMM07,. The data 
structure supports insert and deletemin in 0(logn + 6) amortized time, where 
the deletemin operation returns either the minimum element among the uncor- 
rupted elements, or a corrupted element. A matching lower bound was given 
there as well. 

Resilient Counters: Brodal, Gr0nlund, J0rgensen, Moruz and 
M0lhave |B JMM09] . developed several resilient counters, supporting increments 
and queries, where the result of a query is an a-additive approximation to the 
number of increments performed until the query. While the proven lower bound 
of Q(S) space and time is not achieved, several interesting tradeoffs are presented 
there. 

Dynamic Programming: Caminiti, Finocchi and Fusco jCFFll] and Caminiti, 
Finocchi, Fusco and Silvestri [C FFSllj . developed a resilient and cache-oblivious 
dynamic programming meta algorithm, computing the correct answer with high 
probability, using 0{n'^ +6'^'^^) and 0{n'^ + nS) space, where d is the dimension 
of the table of the dynamic programming. 

Resilient External Memory Algorithms: The problem of designing algorithms 
that are simultaneously cache efficient and resilient was addressed by Brodal, 
J0rgensen, Gr0nlund and M0lhave |B JM09) . They showed matching upper 
bounds and lower bounds for a deterministic and randomized dictionary, a de- 
terministic priority queue, and a deterministic sorting algorithm. 
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k-d Trees: The problem of fc-means clustering in the presence of memory faults 
was addressed by Gieseke et al. |GMV10] . They developed a resilient fc-d tree, 
supporting orthogonal range queries in 0{\/nS + t) where t is the is the size of 
the output. 

1.3 Results 

Deterministic Resilient Selection Algorithm: The main focus of this work is 
on the selection problem (sometimes called the fc-order statistic problem) in the 
FRAM model, where one wishes to locate the A:*'' smallest element in an unsorted 
array of size n, in the presence of memory faults. The following main theorem is 
proved in Section H] 

Theorem 1. There exists a deterministic resilient selection algorithm with time 
complexity 0{n). 

Interestingly, the running time does not depend on the number of faults. 
Moreover, the algorithm does not need to know d explicitly. The selection prob- 
lem is a classic problem in computer science. Along with searching and sorting, 
it is one of the basic problems studied in the field, taught already at undergrad- 
uate level (e.g., |CLRS09| ). The fc-order statistic of a set of samples is a basic 
concept in statistics as well (e.g., |ABN93) V There are numerous apphcations for 
the selection problem, thus devising efficient algorithms is of practical interest. 
The textbook algorithm by Blum, Floyd, Pratt, Rivest and Tarjan jBFP"'"73] 
achieves linear time complexity in the (non-faulty) RAM model. 

When considering the selection problem in the FRAM model, the first diffi- 
culty is to define correctnesfH. To this end, the correctness definition used here 
allows to return an element, which may even be corrupted, whose rank is be- 
tween k — a to k + a in the input array. Notice that when a = this definition 
coincides with the non- faulty definition (for a formal definition see Section [5]). 

Randomized Resilient Selection Algorithm: Besides the deterministic algorithm, 
a randomized and in-place counterpart is developed as well. A randomized al- 
gorithm in the FRAM model is an algorithm that can use random coins. The 
faults are still adversarial, but the adversary cannot see the random coins of 
the algorithm, and the algorithm must be correct with probability 1, regardless 
of the coin tosses. The randomized selection algorithm is simpler than to the 
deterministic one, and is likely to beat the deterministic algorithm in practice. 
The following theorem is proven in Section |3l 

Theorem 2. There exists a randomized in-place resilient selection algorithm 
with expected time complexity 0{n + a). 

^ The common notion of considering only the non-corrupted elements is somewhat 
misleading in the selection problem. This is because of the difficulty of not being 
able to distinguish between corrupted and uncorrupted data. 
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Resilient k-d Trees: The selection algorithm presented here can be used to im- 
prove the complexity bounds for resilient fc-d trees developed by Gieseke et 
al. |GMV10| . There, a deterministic resilient algorithm for constructing a fc-d 
tree with 0(nlog^n -I- S'^) time complexity is shown. This can be improved to 
0{n log n) by using the deterministic resilient selection algorithm developed here. 

Theorem 3. There exists a resilient k-d tree which can be constructed in de- 
terministic 0{nlogn) time. It supports resilient orthogonal range queries in 
0{y/nS -\- 1) time for reporting t points. 

Resilient Quicksort Algorithms: The problem of sorting in the FRAM model is 
also revisited, as an application of the resilient selection algorithm. Finocchi et 
al. |FGI09aj , already developed a resilient Mergesort algorithm, sorting an array 
of size n in 0(n log n -I- a5) time, where the uncorrupted subset of the array is 
guaranteed to be sorted. They also proved that this bound is tight. In Section[71 a 
new in-place randomized sorting algorithm which resembles Quicksort and runs 
in 0(nlogn -I- a5) expected time is presented. This sorting algorithm uses the 
randomized selection algorithm as a black box. The following theorem is proven 
in Section [71 

Theorem 4. There exists a resilient deterministic sorting algorithm with worst- 
case running time of 0{n\ogn-\-aS), and a resilient randomized in-place sorting 
algorithm with expected running time of 0{n log n -\- aS). 

1.4 Recursion 

In the (non-faulty) RAM model the recursion stack needs to reliably store the lo- 
cal variables, as well as the frame pointer and the program counter. Corruptions 
of this data can cause the algorithm to behave unexpectedly, and in general the 
recursion stack cannot fit in reliable memory. Some new techniques for imple- 
menting a specific recursion stack which suffices for solving the selection problem 
are developed in Section [5] These techniques are used to develop the resilient 
deterministic selection algorithm presented in Section |4l It is likely that these 
techniques can be used to help implement recursive algorithms for other prob- 
lems in the FRAM model. The main technique developed here which allows to 
use non-tail recursion in the FRAM model is somewhat general, and can be used 
due to the following four points: 

1. Easily Inverted Size Function: When performing a recursive call, the function 
which determines the size of the input to the recursive call can be easily in- 
verted, while needing only 0(1) bits to maintain the data needed to perform 
the inversion. 

2. Small Depth: The depth of the recursion is bounded by 0(log n) and so using 
0(1) bits per level can fit in reliable memory. 

3. Verification: A linear verification procedure is used such that once a recursive 
call finishes, if the procedure accepts, then the algorithm may proceed even if 
some errors did occur in the recursive call. The main point here is that even 
though errors occurred, continuing onwards does not hurt the correctness. 
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4. Amortization: If the verification procedure fails, then the number of errors 
which caused the failure is linear in the amount of time spent on the recursive 
call (not counting other verification procedures that failed within it). This 
means that the amortized cost of each corruption is 0{1). 

The only previous work done in the FRAM model for non-tail recursion was 
done by Caminiti et al. ^ CFFSll] where they developed a recursive algorithm for 
solving dynamic programming. However, the recursion inherited in the problem 
of dynamic programming is simpler compared to the recursion treated in the 
selection problem, due to the structural behavior of the dynamic programming 
table (the recursions depend on positioning within the table, and not on the 
actual data). Moreover, their solution only works with high probability (due to 
using fingerprints for the verification procedure). 



1.5 Related Work 



Other models and techniques to deal with memory corruptions do exist. Some 
of them are given here, with an emphasis on their relation to the FRAM model. 



Error Correcting Codes: The field of error correcting codes and error detecting 
codes deals with the problem of reliably transmitting a message over a faulty 
communication channel. This is achieved by adding redundancy to the message 
(e.g., checksums). For a survey, see, e.g., |P W72] . The solutions developed in this 
field do not treat the implications of corruptions to the computation performed 
on the data. Therefore, applying these methods to the FRAM model in a non- 
naive way is not trivial. 



Error Correcting Memory: Error detecting and correcting codes can be imple- 
mented in the hardware itself (e.g., |CH84] ). While this solution has its advan- 
tages, it imposes some costs in performance and money. 

Pointer-Based Data Structures: Aumann and Bender |AB96] addressed the prob- 
lem of losing data in a pointer-based data structures due to pointer corruptions. 
The data structures suggested by them incur only a small overhead in space and 
time, and guarantee an upper bound on the amount of uncorrupted data that 
can be lost due to pointer corruptions. This is in contrast to the FRAM model, 
where no uncorrupted data is allowed to be lost. 



Fault- Tolerant Parallel and Distributed Computation: Extensive research on 
fault tolerance have be en don e in the field of parallel and distributed computa- 
tion. For a survey, see G99 . The work done in this field deals with resiliency 
with respect to faulty processors or communication links, in contrast to the faulty 
memory which is assumed in the FRAM model. Some of the work assume the ex- 
istence of fault detection hardware, therefore allowing the system to distinguish 
between faulty and non-faulty data, differently from the FRAM model. 
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Checkers: Blum, Evans, Gemmell, Kannan and Naor [BEG"'"94] addressed the 
problem of checking memory correctness in the presence of faults. In this model, 
the data structure is viewed as being controlled by an adversary. The goal of the 
checker, which is allowed to use a small amount of reliable memory, is to detect 
every deviation from the expected data, with high probability. In the FRAM 
model, the goal is not to detect the memory corruptions, but instead, to always 
behave correctly on the uncorrupted subset of the data. 

Fault- Tolerant Sorting Networks: Fault tolerance have been investigated in the 
context of sorting networks. Assaf and Upfal jAU91| developed a resilient sorting 
network, with an 0(log n) multiplicative overhead in the size of the network. The 
computational model is a sorting network and not a general purpose machine, 
as in the FRAM model. 

The Liar Model: In this model, the algorithm can access the data only through 
a noisy oracle. The algorithm queries the oracle and can possibly get a faulty 
answer (i.e., a lie). An upper bound on the number of these lies or a probability 
of a lie is assumed. See, e.g., |FRPU94] and |DGW92j . The data itself cannot get 
corrupted, therefore, in this model, query replication strategies can be exploited, 
in contrast to the FRAM model. 

Other Noisy Computational Model: Several other noisy computational models 
have been investigated. Sherstov |Shel2) . showed an optimal (in terms of degree) 
approximation polynomial that is robust to noise. Gacs and Gal |Gal91j , proved 
a lower bound on the number of gates in a noise resistant circuit. These works, 
as well as others, have more computational complexity theory flavour than the 
FRAM model, and treat different computational models from the FRAM model. 

1.6 Organization 

The paper is organized as follows. In Section [5] some definitions and preliminaries 
are given. In Section [3] the randomized selection algorithm is discussed, followed 
by a discussion of the deterministic selection algorithm, in Section The dis- 
cussion of the stack and recursion implementation is treated independently and 
deferred to Section [5] A discussion on the application of the resilient selection 
algorithm to resilient k-d trees is in Section [51 Finally, the in-place quicksort 
sorting algorithm is shown in Section [T] 

2 Preliminaries 
2.1 Definitions 

Let X be an array of size n of elements taken from a totally ordered set. Let 
X'^ denote the state of X at the beginning of the execution of an algorithm A 
executed on X. Let a < d he the number of corruptions that occurred during 
such execution. 
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Definition 1. Let X be an array and let e be an element. The rank of e in X 

is defined as rankx{e) — \{i ■ X[i] < e}\. The a-rank of fc in X is defined as 
a-rankx (k) = {e : rankx{e) G [k ~ a,k + a]}. 

Notice that the Q-rank of k in X is an interval containing the elements whose 
rank in X is not smaller than k — a and not larger than k -\- a. In particular, if 
a > n, this interval is equal to [—00,00]. Moreover, if a = 0, then this interval 
is equal to the fc-order statistic, thus coincides with the non-faulty definition. 

Definition 2. A resilient fc-sclection algorithm is an algorithm that is given an 
array X of size n and an integer k, and returns an element e G a-rankxo{k), 
where a < 5 is the number of faults that occurred during the execution of the 
algorithm. 

Notice that if a = 0, then this definition coincides with the common non- 
faulty definition. That is, if no faults occur during an execution of a resilient 
selection algorithm, it should locate the exact fc-order statistic. Moreover, if 
a > 0, no algorithm can return the exact fc-order statistics, due to corruptions 
that can happen at the beginning of the execution. Notice also that because the 
algorithm cannot distinguish between corrupted and uncorrupted memory cells, 
it may return an element which was not present in the array at the beginning of 
the execution. 



2.2 Basic Procedures 

Lemma 1. There exists a resilient ranking procedure with time complexity 0(ri), 
that is given an array X of size n and an element e, and returns an integer k 
such that e € a-rankxo{k). 

Proof. A resilient ranking procedure can be implemented by scanning X while 
counting the number of elements smaller or equal to e, denoted by /c. If a = 0, 
then k = rankx{e). If a > 0, then e G a-rank^o (fc), because each corrup- 
tion can change at most one memory cell, changing the rank of e in X by 
at most 1. □ 

Lemma 2. There exists a resilient partition procedure with time complexity 
0{n) and space complexity 0{1), that is given an array X of size n and an cle- 
ment e, and reorders X such that the uncorrupted elements smaller (larger) than 
e are placed before ( after) e, and returns an element k such that e G a-rankx" {k) ■ 

Proof. A resilient partition procedure can be implemented by scanning X while 

counting the number of elements smaller or equal to e, denoted by k, such that 
whenever an element smaller than e is encountered it is swapped with the element 
at position k + 1. □ 

Notice that both procedures compute an integer k such that e G a-rank^o (k). 
Let rank'j^{e) denote the value k computed by either procedure, such that when- 
ever the notation rankx{e) will be used, it will be understood from the context 
which procedure is used. Notice that if a = 0, then ranfc^(e) = rankx{e). 
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3 Randomized Resilient Selection Algorithm 

As a starter, consider the following randomized resilient selection algorithm, de- 
noted by Randomized- Select. The algorithm is an adaptation of the randomized 
non-resilient selection algorithm by Hoare |Hoa61| . with the following modifica- 
tion. The algorithm maintains an interval [lb, ub], where lb (ub) is a lower (upper) 
bound. When the algorithm queries the array X at index i, the value x is chosen 
to be X = min{max{X[i],lb),ub). This guarantees that even a faulty value is 
within the bounds. 

All variables (i.e., I, r, lb, ub, Xp, p, k) are stored using reliable memory cells. 



Algorithm 1: Randomized-Select(Ar, k) 



1 


I 


1, r n, Z6 oo, ub ^ co 




2 


repeat 




3 




Xp -s— random element from X 




4 




Xp -s— min(max{xp, lb), ub) 




5 




partition X around Xp # usin^ 


5 the algorithm from Lemma [5] 


6 




^ Let p denote rank'j^(xp) 




r 




if p — k then 
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1 return Xp 




9 




else i{ p > k then 




10 




1 r i~ p ~ 1, ub ^ Xp 




11 




else ii p < k then 




12 




|_ k^k-p, l^p+l, Ib^ 


Xp 



Theorem 5. There exists a randomized in-place resilient selection algorithm 
with expected time complexity 0{n -|- a). 

Proof. Correctness is proven by induction on the size of the array. The base 
case of size 1 is obvious. For the induction step, assume that for arrays of size 
smaller than n the algorithm returns an element e e a-rankxo{k). Consider an 
execution of the algorithm on an array of size n. Let ai denote the number of 
corruptions that occurred during the first iteration, and let a' denote the number 
of corruptions that occurred during the rest of the execution {a = ai + a'). 
During the first iteration of the algorithm, ii p = k, then e = Xp is returned 
and correctness follows from the definition of the resilient partition procedure, 
and from the fact that Xp is maintained in reliable memory. Otherwise, assume 
without loss of generality, that p < k. The case where p > k is symmetric. 

The second iteration considers a sub-array X' = X[p + l,r] oi size n' < n. 
Therefore, by the induction hypothesis, e G a'-rankx'{k). It is guaranteed that 
e > Xp, because e is taken to be min{max{e,lb),ub). Therefore, e is larger 
than all the uncorrupted elements in ^[1 : p]. Each corruption that occurred 
during the first iteration can change the rank of e by at most 1, therefore 
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e S (a' + ai)-rankx{k) — a-rankx{k). Notice that the above proves that the 
algorithm is correct with probability 1. 

With regard to the expected time complexity, let t denote the number of 
iterations the algorithm does. If there are no faults (i.e., a = 0), then the proba- 
bility of choosing a pivot Xp such that rankx{xp) G [j, ^] is i. However, there 
are two types of possible corruptions. The first type is corruptions of elements 
that are used as pivot elements. The second type is corruptions of elements that 
are not used as pivot elements. Let a' {a") denote the number of corruptions of 
the first (second) type. 

Consider corruptions of the first type. Let io, ■ ■ ■ ,it be indices of iterations 
such that io is the first iteration, it is the last iteration, and for every j > 0, ij+i 
is the first iteration after ij such that rii.^-^ < \ni. + ai - , where Ui. denotes the 
size of the sub-array at the beginning of the i*'' iteration and ai- denotes the 
number of corruptions that occurred between the z*'' iteration and the (zj+i — 1) 



\th 



iteration. It follows that X]j=o '^4 ~ ^"^^ — 



A;=0 I 4 ) 



-k-1 



Let Yj denote the number of iterations between the i* iteration and the {ij+i — 
iteration (i.e., Yj — ij+i — ij). Yj is a random variable with a geometric 
distribution, and ^{Yj) < 2, by a similar reasoning as in the non-faulty case. 
Notice that lE(lj) < 2 even when conditioned on earlier iterations. It follows 
that if there are only corruptions of the second type, then the running time is 
bounded by Y.*j=o 0{ni^)Yj. 

Consider corruptions of the second type. For a sub-array of size n', the prob- 
ability that the adversary corrupts the pivot element using 1 corruption is 
because the adversary cannot see the random coins used by the algorithm. A 
corrupted pivot can result in up to 0{n') extra work. Therefore, the expected 
cost of a corrupted pivot is 0(1). To conclude, the expected time complexity is 
as foUowfH: 



[T{n)] < E 



-f- 0{a") < E 



3=0 



0{a") 



J=0 



- \ n . ^ 

fc=0 



< 



3=0 



Si 



3 oo J-l 
j=0 fe=0 



Y, 



j-k-i 



+ Oia") 



+ 0{a") 



^ For simplicity, for j > t, Yj and rii^ are defined to be 0. 
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Notice that the expectation of Yj is at most 2, even when conditioned on a^^. , 
for k < j (i.e., E [Yj \ a^^] < 2, for k < j). Therefore, using total expectation, it 
fohows that, for k < j: 



<2 J2 

= 2E [a,,] 



Therefore, 



E[T(.)]<2n5:^ +5:5:2: 



j" 00 j-i 



00 j-i 

3=0 k=Q 



j=Q k=0 



]~k~l 



0{a" 



3-k-l 



0{a") 



Oin) + J2 0{a^, ) + 0{a") = 0{n) + O(a') + 0{a") = 0{n + a) 

3=0 



□ 



4 Deterministic Resilient Selection Algorithm 

The following deterministic resihent selection algorithm is similar in nature to 
the non-resilient algorithm by Blum, Floyd, Pratt, Rivest, and Tarjan [BFP+73 , 
but several major modifications are introduced in order to make it resilient. The 
algorithm is presented in a recursive form, but the recursion is implemented in 
a very specific way, as explained in Section [51 

In the non-faulty RAM model, the recursion stack needs to reliably store the 
local variables, as well as the frame pointer and the program counter. Corruptions 
of this data can cause the algorithm to behave unexpectedly, and in general 
the recursion stack cannot fit in reliable memory. Therefore, a special recursion 
implementation is needed. 

Generally, a recursive computation can be thought of as a traversal on a 
recursion tree T, where the computation begins at the root. Each internal node 
u G T performs several recursive calls, which can be partitioned into two types: 
the first type and the second type. Each node performs at least one call of each 
type, and the calls may be interleaved. The idea is for each node u, to locate the 
k^j^ smallest element in the array Xu of size Uu- However, due to corruptions, 
this cannot be guaranteed, therefore a weaker guarantee is used, as explained 
later. 
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4.1 Algorithm Description 



The root of the recursion tree is a cah to D eterminstic-S elect (X , k, —oo, oo). 
The computation of an inner node u has two phases. 

First phase The goal of the first phase is to find a good pivot, specifically, a 
pivot whose rank is in the range [fu,nu — /«], where /„ — [^{^f J — [^J — 60 
Finding a pivot is done by computing the median of each group of five consecutive 
elements in X, followed by a recursive call of the first type, to compute the 
median of these medians. The process is repeated until a good pivot is found. 

Second phase The goal of the second phase is to find a good clement. Specifi- 
cally, an element whose rank is in ± riy] where w is a second type child of u. 
This will be shown to be sufficienlo. This is done by making a recursive call of 
the second type, which considers only the relevant sub-array with the updated 
order statistic. Notice that, unlike the non-faulty selection algorithm, here the 
appropriate sub-array might be padded with more elements, so that the size of 
the sub-array is — fu- This is important for the recursion implementation, as 
explained in Section [SJ If the returned value from the recursive call is not in the 
accepted range, the entire computation of the node repeats, starting from the 
first phase. Once a good element is found, it is returned to the caller. 



* The exact choice of /„ (which is a function of n-u, the size of the node u) relates 
to the recursion implementation as explained in Section [5] The idea is to always 
partition the array at a predetermined ratio, in order to provide more structure to 
the recursion, and this is what allows for the recursion size function to be easily 
invertible, as mentioned in Section [T] Notice that the [^J could be picked to be 
[e • n^J for any constant e < j^, because this is needed for the running time of the 
algorithm, as explained in the proof of Theorem |6l 

^ The exact choice of [ku ± riv] relates to the proof by induction for the correctness of 
the algorithm. The idea is that as long as less then corruptions occurred during the 
computation of v, the rank of the element located by v is guaranteed, by induction, 
to be in these bounds. See the proof of Lemma |4] and the proof of Lemma [T] 
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Algorithm 2: Deterministic-Select (X, n, k, lb, ub) 

1 =f/= The algorithm uses the recursion implementation from Lemma |3] 

2 repeat 
# Let / denote [f^J " LflJ " 6 
begin First Phase 

repeat 
X„, ^ [] 

for i E [1.. \n/5]] do 
1^ X„i[i] ^ median of X[5i, 5i + mm(4, n — 

Xp -s— Deterministic-Select (Xm, [|Xm|/2], lb, ub) 
partition X around Xp # using the algorithm from Lemma [5] 
^ Let p denote rankj^lxp) 
until p e [/, n — f] 

begin Second Phase 
if p = fc then 

I return e = min{max{xp, lb), ub) 
else if p > A: then 
I e Deterministic-Select(X[l, n — /], k, lb, Xp) 
else if p < fc then 
1^ e -s— Deterministic-Select(X[/, n], k — f, Xp, ub) 

20 until rank'j^(e) G [/c ± n^] ^ v is a second type child of the node 

21 return e = min{max{e, lb), ub) 



Let a„ be the number of corruptions that occurred in u's sub-tree. Each node 
uses two boundary values Ibu and u6u which are used similarly to the bounds 
used in the randomized resilient algorithm. 

The recursive calls are made with the parameters X^, n^, k^, Ibu, ub^, and 
each recursive call returns an element x. In Section[5l a recursion implementation 
with the following properties is described. 

Lemma 3. There exists a recursion implementation for the resilient determin- 
istic selection algorithm with the following properties: 

1. The position of X^, n^, the return value, and program counter are reliable^ 

2. If au < riu, then Ibu, ub^, ku are reliable^ 

3. The time overhead induced by the implementation is 0{nu) per call. 
The proof of the Lemma is given in Section [SJ 

® This means that these variables are correct, as long as no more than 5 faults occurred. 
This means that these variables are correct, as long as no more than faults 
occurred. 
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4.2 Analysis 



Let M be a node. Let V = (f i, . . . , v\v\) be it's children, vi is always a first type 
node, and v^y^ is always a second type node. Every second type child, except 
is followed by a first type child, therefore there cannot be two adjacent 
second type children (see Fig.[T]). Let a„ denote the number of corruptions that 
occur in u's sub-tree and let aj^"'^"' denote the number of corruptions that occur 
only in u's data. Let a^' denote the number of corruptions that occur in m's 
data between the execution of Vi and the execution of w^+i (or until u finishes 
its computation, if Vi is the last child of u) and let a° denote the number of 
corruptions that occur in u's data before the execution of vi. It follows that, 
a„ = aj^°™' + ^l^^lg = + <^v,)- Let X° denote the state of X„ at 

the beginning of u's computation. Let X^' denote the state of X„ at the moment 
of the call to vi. 




Fig. 1. A node it with five children; vi, . . . ,V5 is depicted. The nodes vi,V3,V4 are first 
type children of u, while the nodes V2,vi, are second type children of u. The braces show 
the corruptions amortization, specifically, V2 pays for vi and for itself and V3 pays for 
itself. 



The following Lemmas are used to prove the correctness and the running 
time of Deterministic- Select in Thm. [B) 

Lemma 4. //a„ < n„, then e„ £ au-rankxo{ku). 

Proof. The proof is by induction on n„. The base case is defined to be where n„ 
is 1. In this case, p = k, and the claim is correct. For the induction step, note 
that each corruption of an element in can result in at most one rank error. In 
contrast, corruptions in auxiliary information can result in more than one rank 
error per corruption, but this is taken care of, as shown next. 
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By Lemma [21 the recursion implementation guarantees that if au < n^, then 
ku is correct. If the return statement in line 13 is used, then the pivot Xp is 
returned as Cu- The test at line 12 guarantees that ranfc^^ (e„) = ku- From 
the definition of the resilient partition procedure, Cu & a^°'^°'^-rankxo{ku) C 
au-rankxo{ku), as needed. 

If the return statement in line 13 is not used, then the return statement in 
line 19 is used. Let the last child of u be denoted by w = The test at line 18 
guarantees that rank'^ (e„) € [kuzLuy]. Notice that the element is the element 
located by v. Therefore, from the definition of the resilient ranking procedure, 
Cu G {a'-"""-'- +ny)-rankxo{ku). If > riy, then S (a'"""-' +ny)-rankxo{ku) C 
{a'-°'''^ +a,)-rankxoiku) C au-rankxo{ku), as needed. 

Otherwise (i.e., if ay < Uy), then by induction, e„ e ay-rankxf>{ky). Also, 
the recursion implementation guarantees that uby, and ky, are resilient in 
this case. If rankx^{xp) > ky, then ky = fc„, because both are resilient. Also, 
being that uby is resilient, Cy is smaller than all the uncorrupted elements in 
which are larger than uby. Therefore, G {ctu''"'^ + ay)-rankxo (ky) , as needed. 
If rank'^^{xp) < ky, then ky — fc„ — /„, because both are resilient. Also, being 
that Iby is resilient, e„ is larger than all the uncorrupted elements in X„, which 
are smaller than Iby. Therefore, G (a^°^°' + ay)-rankxo (ku) , as needed. □ 

Lemma 5. Let Vi — w be a first type child of u. If aw < ny,, then "iny/lO — 
3(ati, + aj^) — 6 < rankx^ {xp) < 7n„/10 + 3{aw + a^) + 6, where Xp = Cy, is 
the element returned from w to u. 

Proof, a^j < Uy,, therefore from Lemma 21 it follows that e^, G aw-rankxo {ky,). 
Also, the recursion implementation guarantees that ky, is resilient in this case, 
therefore, ky, = \ny/lQ\. There exists at least 3(fcu) — a«, — 2) — a™ elements 
in which are smaller than Xp. This is because each non corrupted median 
of five consecutive elements corresponds to at least 3 elements in X™ which are 
smaller than Xp, and each corrupted element either in Xy, or in Xy which is not 
a median of five consecutive elements can change the rank of Xp by at most 1. 
A similar argument establishes the second inequality. □ 

Lemma 6. Let w — Vi be a first type child of u. If Ui+i is not a second type 
node, then a^ + ay, > [2{nu). 

Proof. Being that w is not followed by a second type node, Xp did not pass the 
test at line 10 (i.e., p ^ [/u,^^ — /„]). There are two cases to consider. 

If ay, > riy, = r«-ti/5], then, in particular, ay, = i7(n„). 

Otherwise, assume that ay, < Uy,. It will be shown that (a^ +ay,) > n„/33 — 
f2{ny). Assume, in contradiction, that this is not the case. It follows, from 
Lemma El that rank^x^iky) G [3n„/10 - 3« + ay,) - 6,7n„/10 + 3« + 
ay,) + 6] C [3n„/10 — n„/ll — 6,7n„/10 + Uu/ll + 6]. This contradicts the 
assumption that Xp did not pass the test at line 10 (i.e., that rank'j^ (ky) ^ 
[3n„/10-n„/ll-6,7n„ + n„/ll + 6]). " □ 

Lemma 7. Let w — Vi be a second type child of u. If w is not the last child of 
u, then ay + ay, > f2(jiu). 
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Proof. Being that w is not the last child ol u, e^, did not pass the test at line 18 
(i.e., rankx^{ew) ^ [fc ± n^])- Again, there are two cases to consider. 
If Qfiu > riw = riu — fm then, in particular, a^j — 

Otherwise, if < riw, then, by LenimaHl € a^-rankxa (fc^,). Moreover, 
kw — ku, and each corruption in can cause the rank of to change by at 
most 1. Therefore, rankx^{ew) € [ky, ± (a™ + However, did not pass 

the test at hue 18, therefore a'^ + > = fl{ny). □ 

Theorem 6. Deterministic- Select is a deterministic resilient selection algorithm 
with time complexity 0{n + a). 

Proof. First, Deterministic- Select is shown to be resilient. Let u be the root of 
the recursion tree, T. If S < n = n„, then by Lemma [4l e € a-rankxo{k), as 
needed. Otherwise, if 5 > n, then there are two cases to consider. If a < n, 
then by Lemma [H e e a-rankx«{k), as before. Otherwise, if a > n, then by 
definition, [—00,00] = n-rankxo{k) = a-rankxo{k). Therefore, for any element 
e, e € a-rankxo{k). In particular, the element returned is correct. 

With regard to the time complexity, consider a non- faulty execution (i.e., 
a = 0). The time complexity T(n) = r(i'n/5])+T([7n/10] + rn/ll]+6)+O(n) = 
0{n) follows, because [n/5] + [7n/10] + [n/ll] < n. 

If a > 0, then there might be some repetitions. Lemma IHl and Lemma [7] show 
that enough corruptions can be charged for the time spent in those repetitions. 
In particular, the C2{nu) corruptions that cause a first type child repetition pay 
for the 0{nu) computation time of the child, and the Q{nu) corruptions that 
cause a second type child repetition pay for the 0{nu) computation time of the 
child, and for the 0{nu) computation time of the first type child that precedes 
it. Figure [T] shows a visualization of this amortization. In both cases there is 
0(1) amortized cost per corruption. Therefore, the overall time complexity is 
0{n + a). □ 

Theorem 7. There exists a deterministic resilient selection algorithm with time 
complexity 0{n). 

Proof. The algorithm Deterministic- Select can be modified to achieve worst- 
case time complexity 0{n). The algorithm maintains a counter c, initialized to 
0, which is a lower bound on the number of corruptions that occurred. Notice 
that c can be maintained in a reliable memory cell. 

The proof of Lcmma[6] shows that if a first phase repetition occurred, it must 
be due to at least [n„/33j corruptions, where u is the current node. Therefore, 
in this case, the counter is incremented by [n„/33j. The proof of Lemma [7] 
shows that if a second phase repetition occurred, it must be due to at least 
corruptions, where v is the second type child of the current node that caused 
the repetition. Therefore, in this case, the counter is incremented by n^. If the 
counter is equal to or larger than n, the algorithm halts with an arbitrary ele- 
ment. 
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The modified algorithm is correct, because the counter is a lower bound of 
the number of corruptions. If c > n, then a > n. Therefore, any element is in 
the a-rank of With regard to the time complexity, notice that the counter 
is also an upper bound, up to a multiplicative constant, for the amount of extra 
work performed due to corruptions. Therefore, as long as c < 2n, which is always 
the case, the total work is 0(n). □ 

5 Recursion Implementation 

In this section, an abstract recursion stack for Deterministic- Select is developed. 
The data structures used by this abstract stack are described, followed by the 
implementation of the operations on it. This leads to the proof of Lemma [3] at 
the end of this section. 

5.1 Data Structures 

Two stacks, one reliable and the other one faulty, together with a constant 
number of reliable memory cells, are used to implement the recursion for the al- 
gorithm Deterministic- Select. An execution path in the recursion tree, T, starts 
from the root and ends at the current node. In each stack, the entire execution 
path is stored in a contiguous region in memory, where the root is at the begin- 
ning, and the current node is at the end. The stacks are depicted schematically 
in Fig. H 

Reliable Stack The reliable stack stores only 9 bits of information per node. 
The height of T is O(logn), therefore it can be stored in a constant number of 
reliable memory cells. For each inner node u € T, the reliable stack stores 1 bit 
to distinguish between a first type child and a second type child. Let p| denote 
the remainder of the division of x by y. For a node of the first type, p^^ is stored. 

For a node of the second type, pn°/^ and pl^^ are stored. Notice that the 0(1) 
reliable memory cells are used down to the bit level. 

Faulty Stack The faulty stack stores 0(?t.„) words of information per node. 
For each node u £ T, the faulty stack stores the elements of X„, as well as fc„, 
Ibu, and The elements of X„ are stored using 1 copy per element, while fc„, 
Z&u, and ubu are stored using 2ri„ -f 1 copies per variable. 

Global Variables Each one of the following global variables is stored using a 
reliable memory cell: 

— The current array size 

— The reliable stack's frame pointer 

— The faulty stack's frame pointer 

— The program counter 
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— The return value 



Notice that at a given moment in an execution only one value per each global 
variable needs to be stored. 



X 



■ 10/3. „31 



Frame pointer Frame pointer 



Fig. 2. Tlie stacks used by tlie recursion implementation are depicted. Tlie reliable 
stacli is at the top and the faulty stack is at the bottom. The execution path is composed 
of the root, w, it's first type child, v, and v's second type child, u. The figure shows the 
situation when u begins its computation. For each node, the reliable stack stores the 
traversal direction to its child (drawn as a pointed arrow), as well as the remainders, 
or p^"^^ and p^^, while the faulty stack stores the sub-array X, as well as 2g + 1 
copies of lb, ub, and k. The frame pointers are also shown. 



5.2 Operations 

Two operations are implemented by the recursion implementation. A push opera- 
tion corresponds to a recursive call, and a pop operation corresponds to returning 
from a recursive call. 



Push When a node u calls its child w, the following is done. The information 
of whether is a first type child or a second type child of u is written to the 
reliable stack, as well as the relevant remainders (i.e., p^^ or plZ^^ and p^^), and 
the reliable stack's frame pointer is incremented by 9 bits. Then, the relevant 
sub-array is pushed to the faulty stack, followed by the values Ib^, uby, and ky. 
If ?; is a first type child, then is updated to [n„/5]. If w is a second type 
child, then riy is updated to ri„ — /„. The faulty stack's frame pointer is updated 
accordingly, and the program counter is set to line 1. Then, the computation 
continues to v. 
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Pop When v finishes its computation, the following is done. First, the reUable 
stack's frame pointer is decremented by 9 bits, and the information of whether 
t; is a first type or a second type child of u is read, as well as the remainder (i.e., 
Pn„ or p^nl^ and p]^J. 

If i; is a first type child, then n„ is updated to 5(n„ — 1) + pfj^^ . If u is a second 

type child, then n„ is updated to (110/87) • (n„ - pi°/V(10/3) + Pi^^ll - 6). 
Notice that this function is the inverse function of n^ — fm which is the function 
used to update n when calling a second type child, as explained before. The 
faulty stack's frame pointer is decremented by Uu + 3(2n„ + 1) words. 

The 2nu + 1 copies of /&„, ubu, and are read, and the computed majority 
of their copies are stored in reliable memory and used as the values for 
ubu, and fc„. Then, the computation returns to m, either to line 8 or to line 18, 
depending on the type of u. 

5.3 Proof of Lemma [3] 

Proof. The frame pointers, the return value, and the program counter were 
shown to be reliable, as well as the location of the array X„ and its size n„. 
Z&u, ubu, and fc„ are stored using 2ri„ + 1 copies each, therefore, if a„ < n^, then 
these parameters are reliable. The time overhead induced by the frame pointers, 
return value, program counter, location of the array and its size n„ is a 
constant. The time overhead induced by Ib^, ub^, and fc„ is 0{nu). Therefore, 
the time overhead of the recursive implementation is 0{nu)- □ 

6 Resilient k-d Trees 

Gieseke et al. [GMVlOj . developed a resilient fc-d tree, where k denotes the 
dimension (this k is not related to the k in the selection algorithm). As is the 
case with non-resilient fc-d trees, the construction involves multiple partitioning 
of the points by the median. For example, if fc = 2, then at even-depth nodes, 
the points are partitioned by the ^-coordinate median, and at odd-depth nodes, 
the points are partitioned by the y-coordinate median. In a resilient fc-d tree, 
the partitioning ends at the leaves, which contain b5 ~ 0{S) points each, where 
6 is a parameter. 

Gieseke et al. developed a randomized resilient selection algorithm, which is 
somewhat different from the randomized resilient selection algorithm developed 
in this work. Both algorithms achieve the same expected time complexity, 0{n + 
a). Using these algorithms to build a resilient fc-d tree results in 0{n\ogn + 
6) expected time complexity. However, the selection algorithm developed here 
guarantees that the element returned has rank between k — a and fc -f a in 
the input array, while the algorithm developed in [GMVIO] only guarantees a 
rank between fc — 0{S) and fc -|- 0{S). This difference does not have asymptotical 
consequences on the height of the resulting fc-d tree. 

For a deterministic fc-d tree construction algorithm, Gieseke et al. used the 
resilient sorting algorithm developed by Finocchi et al. |FGI09a) in order to 
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partition the points around the median. This resuhs in 0{n\og^ n + a6) time 
complexity. By using the deterministic resiUent selection algorithm developed 
here, the time complexity is reduced to 0{n\ogn) and implies the following 
theorem. 

Theorem 8. There exists a resilient k-d tree which can be constructed in de- 
terministic 0(n log n) time. It supports resilient orthogonal range queries in 
0{VriS + t) time for reporting t points. 

7 Resilient Quicksort Algorithms 

The famous quicksort algorithm is based on the idea of selecting a pivot, parti- 
tioning the input by it, and recursively sorting each side of the partition. In the 
FRAM model the difficulty is in having to maintain the w(l) partitioning loca- 
tions. This is true for both a recursive and iterative implementation. One natural 
idea for dealing with this difficulty is to partition the array at the median. For 
sake of simplicity assume that the size of the input is a power of twcH. However, 
using a resilient selection algorithm for locating the median in the FRAM model 
and partitioning around the element returned does not guarantee that the array 
is split into two parts of equal size, due to corruptions that may occur during the 
execution of the selection algorithm, returning an element which is only roughly 
the median. Thus, there is a need to develop a resilient splitting algorithm, which 
is defined as follows. 

Definition 3. A resilient splitting algorithm is an algorithm that is given an 
array X of size n and an integer k, and reorders the array such that any uncor- 
rupted element in X[l,k] is smaller than any uncorrupted element in X[k,n]. 

In section lTTl two non-efficient resilient splitting algorithms are shown: one is 
deterministic and runs in 0{an) worst-case time, and the second is randomized 
and in-place and runs in 0{an) expected time. 

In section[721 two efficient resilient splitting algorithms are shown: one is de- 
terministic and runs in 0{n + aS) worst-case time, and the second is randomized 
and in-place and runs in 0{n + a5) expected time. These efficient algorithms use 
the non-efficient algorithms from Section FTTl 

7.1 Sandboxed Splitting Algorithms 

The basic idea behind the resilient splitting algorithms is to test the rank of the 
element returned by the selection algorithm, and fix it, as needed. In order to 
achieve this goal, the notion of Sandboxing an algorithm is introduced. The idea 
is to convert a non-resilient algorithm with a known bound on its running 

® If this is not the case then careful padding can take place. Being that the interest 
here is in an in-place algorithm, the padding can be done abstractly by knowing that 
each access to an array location which does not exist can be considered as oo. 
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time and space usage into a resilient algorithm A' . However, in order to be able 
to do this, there must exist a verification procedure which can verify that the 
output of A is correct, and the algorithm A needs to be non- destructive, a notion 
which is defined later. 

Finocchi, Grandoni and Italiano ( |FGI09b| . Lemma 4) already considered 
a similar reduction. However, there is an unfortunate flaw in their proof given 
there, because it does not take into consideration the following two cases: A 
corrupted variable that can cause the non-resilient procedure to require a much 
larger time complexity (maybe even getting stuck in an infinite loop), and mem- 
ory corruptions that can cause the non-resilient procedure to alter memory cells 
used by other parts of the system. These problems can be overcome by confining 
the execution to a predetermined area in memory and having an upper bound 
on the running time of A. The area in memory is referred to as the sandbox. For 
a problem P and an input X, let P{X) denote the set of correct solutions of P 
on X. 

Definition 4. Let A be an algorithm for a problem P. Assume that an execution 
of A on an input X can be interrupted at any point in time, and let X' denote 
the state of the input after such an interruption. A is non-destructive if for any 
execution of A on any input X and on any set of random coins, and for any 
interruption with any possible sequence of faults, P{X) = P{X'). 

Lemma 8. Let A be a non-resilient and a non-destructive algorithm solving 
problem P with time complexity Ta (either worst-case or expected) and space 
complexity Sa- Let C be a resilient verification procedure for A with worst-case 
time complexity Tc and space complexity Sc which decides the correctness of an 
execution of A. Then there exists a resilient algorithm A' which solves P, and 
has time complexity 0((1 -\-a){TA -\-Tc)) which is either worst-case or expected, 
depending on Ta, and space complexity 0{Sa + Sc)- 

Proof. A sandboxed version of A, denoted by A', is defined as follows. The al- 
gorithm works in rounds. In each round, A' runs a modified version of A, as 
defined next. If the running time of A is worst-case, then, to guarantee that A 
will not run for too long. A' runs A for no more than Ta steps. If the running 
time of A is in expectation, then, to guarantee that A will not run for too long, 
A' runs A for no more than 2Ta steps. This is done by counting the number of 
computational steps that A performs. To guarantee that A will not alter mem- 
ory cells other than its own. A' runs A confined to a memory region of size Sa- 
The counter which counts the computational steps as well as the two boundaries 
for the memory region are stored in reliable memory cells. After running the 
modified A algorithm. A' calls C to check the correctness of A's computation. If 
C returned a positive answer. A' halts. Otherwise, a new round begins, but only 
after the memory sandbox is flushed. That is, immediately after a non-successful 
round ends, all of the working memory is erased, but the input is left as it is, 
for the next round. 

The memory sandbox guarantee that the (non-resilient) computation of A 
would not alter memory cells outside of A's computation. A' halts only after 
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the resilient verification procedure C returned a positive answer. Therefore, A! 
is correct, even in the presence of memory faults. 

If the running time of A is worst-case, then each round takes Ta + Tc time. In 
a non-faulty round, A is correct. By the pigeon hole principle, if there are more 
than a rounds, at least one of them is non-faulty. Denote the state of the input at 
the beginning of this non-faulty round by X'. A is a (non-resilient) algorithm for 
P, therefore, in this non-faulty round, it computes a correct output y e P{X'). 
A is non-destructive, therefore P{X) — P{X'). It follows that y G P{X), i.e., y 
is correclQ. Therefore, there are at most a -1- 1 rounds. 

If the running time of A is expected, then each round takes 2Ta + Tc time. In 
a non-faulty round, the probability that A would halt within 2Ta computational 
steps is at least i, by Markov's inequality. Therefore, the expected number of 
rounds is at most 2a + 1. 

The space used by the calls to A and C can be reused, therefore, the space 
complexity is 0{Sa + Sc)- □ 

This general notion of sandboxed algorithms can be used for designing re- 
silient splitting algorithms. For the deterministic resilient splitting algorithm, 
algorithm A is executed by using the non-resilient deterministic selection al- 
gorithm to locate the median, and partitioning the array around the element 
returned. The verification procedure C is implemented by testing that each side 
of the partition has the same size. For the randomized resilient splitting algo- 
rithm, algorithm A is executed by using the non-resilient randomized selection 
algorithm to locate the median. Notice that in the randomized case, Sa = 

Notice that both the non-resilient deterministic selection algorithm and the 
non-resilient randomized selection algorithm needs to slightly be altered in order 
to be non-destructive. The only operation that these algorithms perform which 
might alter the input is swapping. The idea is to make these swaps atomic, i.e., 
only stopping the algorithm after such a swap is fully completed. Notice that the 
/c-order statistic of an input array does not depend on the specific permutation 
of the input arra"vF"l. 

Corollary 1. There exists a deterministic resilient splitting algorithm with worst- 
case time complexity 0{an), and a randomized in-place resilient splitting algo- 
rithm with expected time complexity 0{an). 

Proof. The proof follows from Lemma |8] and from the discussion above. □ 

Denote the algorithm from Lemma [T] by Sandboxed- Split. The running time 
of such an algorithm is rather costly, but it is still useful when considering small 

® If ^ is a Monte Carlo algorithm then P € P{X')] > |, and because of the non- 
destructiveness of A, it follows that P [j/ £ P{X)] > f , as needed. 
Another way of altering these algorithms to make them non-destructive is by copying 
the input array to a second and temporary array. Then, performing all of the swaps 
only on the temporary array, and making sure that the input array is not altered at 
all, by putting it outside the memory sandbox. This solution, however, has a cost in 
time and space. 



22 



arrays. For the resilient splitting algorithm, the idea is to reduce the size of the 
array, and then execute Sandboxed-Split. 

7.2 Efficient Splitting Algorithms 

Consider the following generic algorithm, denoted by Generic-Resilient-Split. 
The algorithm uses either Deterministic- Select or Randomized- Select, denoted 
here by Generic-Resilient-Select, to locate both the (A; — 5)*'' and the {k + 
order statistics. Then, it uses Sandboxed-Split to split the remaining 0{S) ele- 
ments. 



Algorithm 3: Generic-Resilient-Split(X, k) 

1 I Generic- Resilient-Select (X [1, n], k — 6) 

2 partition X around / 

3 if rankx{l) = k then 

4 1^ return 

5 r ^ Generic-Resilient-Select(X[Z, n], k — I -\- S -\- 1) 

6 partition X[^,n] around r 
r if rankxir) = k then 

8 |_ return 

9 Sandboxed-Split (X [i, r], k — I) 



Lemma 9. Generic-Resilient-Split is a resilient splitting algorithm. 

Proof. After the array is partitioned around I, the uncorrupted elements in 
are smaller than the uncorrupted elements in X[l,n\. After the array 
is partitioned around r, the uncorrupted elements in A[/,r] are smaller than 
the uncorrupted elements in X[r, n]. After the call to Sandboxed-Split, the un- 
corrupted elements in X[l, k — I] are smaller than the imeorrupted elements in 
X[k — l,r]. It follows that the uncorrupted elements in X[l, k] are smaller than 
the uncorrupted elements in X[k,n], as needed. □ 

The following corollaries follow by substituting Generic-Resilient-Select by 
either the deterministic or randomized versions. Notice that in both cases, the 
call to Generic- Resilient- Select takes 0{n -\- a) (either in expectation or worst- 
case as needed). The size of the sub-array X[Z,r] is 0{S), therefore the call to 
Sandboxed- Select takes 0{a5) time. 

Corollary 2. There exists a deterministic resilient splitting algorithm, vjith worst- 
case running time 0{n-\- ad), and a randomized in-place resilient splitting algo- 
rithm with expected running time 0{n-\- aS). 
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7.3 Resilient Quicksort Algorithms 



Using the generic resilient splitting algorithm as a black box, one can sort re- 
siliently using Generic- Resilient- Quicksort. Notice that this algorithm does not 
use more than 0(1) space, except for the space used by the splitting algorithm. 



Algorithm 4: Generic-Resilient-Quicksort(X) 



1 for d e [l..log?i] do 

for c G [0..2'^ - 1] do 

n' ^ 

X' ^ X[ji' ■ c + l,n' ■ [c+l)] 
^ The array is split in-place. 
Gcneric-Resilient-Split(X', n' 12) 



Lemma 10. Generic-Resilient-Quicksort is a resilient sorting algorithm. 

Proof. Consider two uncorrupted elements a and h from the input, where a < b. 
There exists some element p which partitions them, at which point a will be put 
before b in the array, and from then onwards their order will remain the same. 

□ 

The following theorem follows. 

Theorem 9. There exists a deterministic sorting algorithm with worst-case run- 
ning time of 0{n\ogn -\- ad), and a resilient randomized in-place sorting algo- 
rithm with expected running time of 0{nlogn -\- ad). 

Proof. Using Corollary [5] and Lemma [TUl the theorem follows. □ 
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